feat(eval): Add skill evaluation framework by saadshahd · Pull Request #5 · saadshahd/moo.md

saadshahd · 2025-12-16T21:01:26Z

Summary

CI workflow runs skill auto-triggering tests on PRs touching plugin code
Uses claude-code-action with --plugin-dir to load plugins from checked-out repo
Local testing via ./eval/run.sh --simple
5 test cases covering hope:gate, hope:soul, hope:trace, product, wordsmith

Test plan

CI workflow triggers and completes
At least one test returns VERDICT: PASS
Local ./eval/run.sh --simple hope-gate-completion works

- CI workflow runs on PRs touching plugin code - Uses claude-code-action with --plugin-dir to load plugins - Local testing via ./eval/run.sh --simple - 5 test cases for skill auto-triggering - JSON schema for structured output (future API key support)

greptile-apps · 2025-12-16T21:04:15Z

Greptile Overview

Greptile Summary

Adds comprehensive skill evaluation framework using claude-code-action for automated testing of skill auto-triggering.

CI workflow runs 5 test cases in parallel matrix strategy on PRs touching plugin code
Local testing via ./eval/run.sh --simple with OAuth support
Well-structured test cases for hope:gate, hope:soul, hope:trace, product, and wordsmith skills
Claude self-evaluates against expected behaviors, enforcing structured output via JSON schema
Clear documentation for both developers (docs/dev/evaluations.md) and users (eval/README.md)
CHANGELOG.md properly updated following project conventions

Confidence Score: 4/5

This PR is safe to merge with low risk - adds testing infrastructure without touching production code
Score reflects well-designed evaluation framework with good documentation and proper error handling. Minor consideration around PARTIAL verdict handling in CI (currently passes but could be stricter). All test cases follow consistent structure, shell script is clean, and changes align with CLAUDE.md conventions including proper CHANGELOG entry.
No files require special attention - all files are well-structured with appropriate documentation

Important Files Changed

File Analysis

Filename	Score	Overview
.github/workflows/eval.yml	4/5	Adds CI workflow for skill auto-triggering tests using claude-code-action with matrix strategy for parallel test execution
eval/run.sh	5/5	Local test runner with --simple flag for OAuth compatibility and help documentation
eval/schema.json	5/5	JSON schema for structured evaluation output with verdict, behaviors, and evidence tracking
docs/dev/evaluations.md	5/5	Developer documentation explaining evaluation philosophy, architecture, and rationale for tool choices

Sequence Diagram

sequenceDiagram
    participant PR as Pull Request
    participant GH as GitHub Actions
    participant CCA as claude-code-action
    participant Claude as Claude Code
    participant Plugins as Plugin Skills
    participant YAML as Test Case YAML

    PR->>GH: Trigger on plugin/** or eval/** changes
    GH->>GH: Matrix strategy: spawn 5 parallel jobs
    
    loop For each test case
        GH->>CCA: Run with OAuth token
        CCA->>Claude: Initialize with --plugin-dir flags
        Claude->>Plugins: Load hope, product, wordsmith, founder, career
        
        CCA->>Claude: Send eval prompt
        Claude->>YAML: Read test case file
        YAML-->>Claude: Extract prompt + expected_behaviors
        
        Claude->>Claude: Process prompt (skills auto-trigger)
        Claude->>Plugins: Skills evaluate trigger conditions
        Plugins-->>Claude: Skill activates (or doesn't)
        
        Claude->>Claude: Self-evaluate response vs expected_behaviors
        Claude->>Claude: Generate verdict (PASS/PARTIAL/FAIL)
        Claude-->>CCA: Return result with verdict
        
        CCA-->>GH: Output result
        GH->>GH: Parse verdict with grep
        
        alt VERDICT: PASS
            GH->>GH: Print success ✓
        else VERDICT: PARTIAL
            GH->>GH: Print warning ⚠
        else VERDICT: FAIL or no verdict
            GH->>GH: Fail job with error
        end
    end
    
    GH->>PR: Report CI status

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-16T21:04:15Z

+          elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
+            echo "⚠ $TEST_NAME: PARTIAL"


style: VERDICT: PARTIAL exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: .github/workflows/eval.yml Line: 72:73 Comment: **style:** `VERDICT: PARTIAL` exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates. <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

claude · 2025-12-16T21:06:31Z

PR Review: Skill Evaluation Framework

Thank you for this well-structured addition! The evaluation framework is a solid approach to ensuring skill quality.

Strengths

Architecture & Design

Self-evaluation approach is elegant and aligns well with Claude Code philosophy
Matrix strategy in CI allows parallel test execution - good for performance
fail-fast: false ensures all tests run even if one fails - excellent for visibility
Clear separation between simple mode (OAuth) and full mode (API key) shows good UX thinking
Test case schema (YAML) is well-structured and extensible

Documentation

Excellent docs in eval/README.md and docs/dev/evaluations.md
Clear rationale for why NOT using promptfoo/SDK (docs/dev/evaluations.md:96-106)
CHANGELOG.md properly updated
CLAUDE.md updated with evaluation instructions

Code Quality

Shell script uses set -euo pipefail - excellent error handling
Proper use of SCRIPT_DIR for path resolution
Help function is clear and useful
Good use of 2>/dev/null to suppress errors

Issues & Recommendations

1. CRITICAL: Test Case Date Inconsistency

Location: All test YAML files (eval/cases/skill-triggers/*.yaml:26)

All test files show created: 2024-12-15 but the current date is 2025-12-16. This appears to be a typo.

Fix: Update to 2025-12-15 in all 5 test files.

2. CI Workflow: Missing Plugin Directory Check

Location: .github/workflows/eval.yml:56

The workflow references --plugin-dir ./founder --plugin-dir ./career but there is no validation that plugin directories are valid.

Recommendation: Add a validation step before running tests.

3. Shell Script: Partial Test Name Matching

Location: eval/run.sh:49

Documentation shows ./eval/run.sh --simple hope-gate but the actual test is hope-gate-completion.yaml. This will fail.

Fix: Either update docs to use full names, OR add fuzzy matching for partial test names.

4. Schema: reasoning Should Be Required

Location: eval/schema.json:45

The reasoning field is optional, but the CI prompt says explain your reasoning briefly. This could cause validation failures.

Fix: Add reasoning to required fields array.

5. Test Coverage: Missing Negative Cases

Location: eval/cases/skill-triggers/

All 5 test cases validate positive triggers. No tests for skills that should NOT trigger on certain prompts.

Recommendation: Add negative test cases to ensure skills do not over-trigger.

6. Test Coverage: Missing founder/career Tests

Location: .github/workflows/eval.yml:56

CI workflow loads founder and career plugins but no test cases exist for them.

Fix: Either add test cases or remove from --plugin-dir flags.

7. Security: Potential Log Exposure

Location: .github/workflows/eval.yml:61-66

The full RESULT output is echoed to CI logs. If Claude response contains sensitive repo info, it will be exposed.

Mitigation: Sanitize output or only log verdict line.

8. Performance: Timeout May Be Too Short

Location: .github/workflows/eval.yml:57

timeout_minutes: 5 might be tight for large repos or slow runners.

Recommendation: Increase to 10 minutes.

9. Documentation: Missing Failure Triage Guide

Location: docs/dev/evaluations.md:82-93

Troubleshooting section does not explain how to debug failed CI tests locally.

Recommendation: Add Debugging Failed Tests section with step-by-step instructions.

Summary

Code Quality: 4/5
Test Coverage: 3/5 (Missing negative cases and 2 plugins)
Documentation: 5/5 (Excellent!)
Security: 4/5 (Minor log exposure risk)
Performance: 4/5 (Timeout could be longer)

Recommended Actions

Must Fix:

Update test case dates from 2024 to 2025
Fix shell script example or add fuzzy matching
Make reasoning required in schema.json

Should Fix:
4. Add plugin validation step to CI
5. Add test cases for founder/career or remove from workflow
6. Increase timeout to 10 minutes
7. Add negative test cases

Nice to Have:
8. Add failure triage guide to docs
9. Collect test artifacts in CI
10. Sanitize CI log output

Overall Assessment

This is a high-quality PR that introduces a well-thought-out evaluation system. The self-evaluation approach is clever and avoids external dependencies. The documentation is exemplary.

With the fixes above (especially the date correction and schema update), this is ready to merge.

Verdict: APPROVE with minor changes requested

- Remove `timeout_minutes` input (not valid for claude-code-action) - Add `show_full_output: true` for better CI debugging - Add session_id to debug output - Improve error messages for execution_file issues

claude · 2025-12-16T21:09:28Z

PR Review: Skill Evaluation Framework

Summary

This PR adds a comprehensive skill evaluation framework to test auto-triggering across the moo.md plugin ecosystem. The implementation is well-architected and follows the project's conventions.

Strengths

Architecture & Design

Smart approach: Using Claude to evaluate Claude is elegant and aligns with Claude Code's philosophy
Flexible testing: Both CI (via claude-code-action) and local testing (via ./eval/run.sh) are supported
Parallel execution: Matrix strategy in CI allows concurrent test runs (lines 21-29 in .github/workflows/eval.yml:21-29)
Proper fail-fast handling: fail-fast: false ensures all tests run even if one fails (.github/workflows/eval.yml:22)

Code Quality

Clean shell script: Proper error handling with set -euo pipefail (eval/run.sh:3)
Good UX: Help text and clear usage examples (eval/run.sh:39-52)
Documentation: Three levels of docs (README, dev guide, inline comments) provide good coverage

Conventions Adherence

CHANGELOG.md updated: Correctly added under [Unreleased] with proper category (CHANGELOG.md:12-15)
CLAUDE.md updated: Added eval instructions in appropriate section (CLAUDE.md:41-46)
Consistent naming: Test cases use kebab-case as per project conventions

Issues & Recommendations

1. Security: Secrets Exposure Risk

Severity: HIGH

The workflow uses show_full_output: true (.github/workflows/eval.yml:57), which logs the entire Claude output. If Claude accidentally includes sensitive data (API keys, tokens, file contents with secrets), it will appear in CI logs.

Recommendation:

# Remove or set to false
show_full_output: false  # Only show summary

If full output is needed for debugging, add a comment explaining why and ensure skills never process sensitive data.

2. Error Handling: Missing Claude CLI Check

Severity: MEDIUM

The local test script (eval/run.sh) assumes claude CLI is installed but never checks. If missing, users get cryptic 'command not found' errors.

Recommendation (eval/run.sh:5-8):

# Add after SCRIPT_DIR definition
if \! command -v claude &> /dev/null; then
  echo 'Error: claude CLI not found. Install Claude Code first.'
  exit 1
fi

3. Bug: Test Case Schema Mismatch

Severity: MEDIUM

Test YAML files include fields that are not used anywhere:

model_tier, timeout_ms (hope-gate-completion.yaml:11-12)
evaluators, tags, created (hope-gate-completion.yaml:14-26)

The CI workflow only reads prompt and expected_behaviors (.github/workflows/eval.yml:44-47). This creates maintenance confusion.

Options:

Remove unused fields from YAML files (cleaner)
Document in docs/dev/evaluations.md which fields are reserved for future use
Use the fields (e.g., pass timeout_ms to claude-code-action if supported)

4. Robustness: Partial CI Failures Not Handled

Severity: MEDIUM

The CI exits 0 on VERDICT: PARTIAL but this masks regressions where skills trigger but miss behaviors.

Current (.github/workflows/eval.yml:94-95):

elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
  echo "⚠ $TEST_NAME: PARTIAL"  # Exit code 0

Recommendation:

elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
  echo "::warning::$TEST_NAME returned PARTIAL verdict"
  # Optionally: exit 1 to fail the build on partial

Consider adding a workflow input to toggle strict mode (fail on PARTIAL).

5. Performance: Serial Test Execution Locally

Severity: LOW

eval/run.sh runs tests sequentially (eval/run.sh:76-78). With 5+ tests, this is slow.

Recommendation:
Add a --parallel flag using GNU parallel or background jobs.

6. Code Quality: Hardcoded Plugin List

Severity: LOW

Plugin directories are hardcoded in two places:

CI workflow (.github/workflows/eval.yml:56)
Docs (docs/dev/evaluations.md:32)

If a new plugin is added (e.g., fitness/), developers must remember to update both.

Recommendation:
Generate plugin list dynamically from .claude-plugin/marketplace.json to ensure the plugin list stays in sync.

7. Documentation: Missing Failure Examples

Severity: LOW

README shows successful output but not what failures look like. Developers won't know if FAIL/PARTIAL is expected for work-in-progress skills.

Add to eval/README.md:
Example output for both passing and failing tests.

8. Best Practice: Verdict Pattern Too Loose

Severity: LOW

Verdict detection uses grep -q "VERDICT: PASS" (.github/workflows/eval.yml:92), which matches anywhere in output. If Claude explains 'previous VERDICT: PASS was wrong', the test incorrectly passes.

Recommendation:
Match only at end of output:

if tail -n 5 "$EXECUTION_FILE" | grep -q "^VERDICT: PASS$"; then

Minor Nits

Inconsistent flag naming: Workflow uses --plugin-dir but docs mention it is for 'adding plugins' (eval/run.sh does not use it). Clarify in docs.
Typo in workflow: Line 100 says 'The output may not contain the expected VERDICT line' but should suggest checking the execution file path (.github/workflows/eval.yml:100-101).
JSON schema unused locally: The schema is only used in non---simple mode, but docs do not explain when/why to use full mode (eval/README.md:20-22).

Test Coverage Assessment

Current tests cover:

✅ 3 hope skills (gate, soul, trace)
✅ 1 product skill
✅ 1 wordsmith skill
❌ 0 founder skills
❌ 0 career skills

Recommendation: Add at least one test for founder and career to match the workflow's path triggers (.github/workflows/eval.yml:9-10).

Security Checklist

✅ No secrets in code
⚠️ show_full_output: true could leak secrets (see issue Add Claude Code GitHub Workflow #1)
✅ Workflow permissions minimal (contents: read, id-token: write)
✅ No arbitrary code execution from YAML
✅ Shell script uses safe variable expansion

Final Verdict

Recommendation: Approve with changes

This is solid work that adds real value. The architecture is sound and the documentation is thorough. Address the HIGH severity issue (#1) before merging, and consider the MEDIUM severity items for follow-up PRs.

Must-Fix Before Merge

Remove or document show_full_output: true to prevent secret leaks

Should-Fix (Before or After Merge)

Add Claude CLI installation check
Clean up or document unused YAML fields
Decide on PARTIAL verdict handling

Nice-to-Have

5-8. Performance, maintenance, and UX improvements

Great job on this foundation! The eval framework will make maintaining plugin quality much easier.

Confidence: 85% - Thorough review of all files, tested shell script logic mentally, but have not run the CI workflow end-to-end.
Key Assumption: The claude-code-action@v1 action's show_full_output does not redact secrets automatically.
Reversible: All issues have low-risk fixes that will not break existing functionality.

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

fix(eval): Remove invalid timeout_minutes, add debugging

68411b2

- Remove `timeout_minutes` input (not valid for claude-code-action) - Add `show_full_output: true` for better CI debugging - Add session_id to debug output - Improve error messages for execution_file issues

saadshahd merged commit b2b44d0 into main Dec 16, 2025
1 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Add skill evaluation framework#5

feat(eval): Add skill evaluation framework#5
saadshahd merged 2 commits intomainfrom
feat/skill-evaluations

saadshahd commented Dec 16, 2025

Uh oh!

greptile-apps bot commented Dec 16, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Dec 16, 2025

Uh oh!

claude bot commented Dec 16, 2025

Uh oh!

Uh oh!

claude bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		elif echo "$RESULT" \| grep -q "VERDICT: PARTIAL"; then
		echo "⚠ $TEST_NAME: PARTIAL"

Conversation

saadshahd commented Dec 16, 2025

Summary

Test plan

Uh oh!

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 16, 2025

PR Review: Skill Evaluation Framework

Strengths

Architecture & Design

Documentation

Code Quality

Issues & Recommendations

1. CRITICAL: Test Case Date Inconsistency

2. CI Workflow: Missing Plugin Directory Check

3. Shell Script: Partial Test Name Matching

4. Schema: reasoning Should Be Required

5. Test Coverage: Missing Negative Cases

6. Test Coverage: Missing founder/career Tests

7. Security: Potential Log Exposure

8. Performance: Timeout May Be Too Short

9. Documentation: Missing Failure Triage Guide

Summary

Recommended Actions

Overall Assessment

Uh oh!

Uh oh!

claude bot commented Dec 16, 2025

PR Review: Skill Evaluation Framework

Summary

Strengths

Architecture & Design

Code Quality

Conventions Adherence

Issues & Recommendations

1. Security: Secrets Exposure Risk

2. Error Handling: Missing Claude CLI Check

3. Bug: Test Case Schema Mismatch

4. Robustness: Partial CI Failures Not Handled

5. Performance: Serial Test Execution Locally

6. Code Quality: Hardcoded Plugin List

7. Documentation: Missing Failure Examples

8. Best Practice: Verdict Pattern Too Loose

Minor Nits

Test Coverage Assessment

Security Checklist

Final Verdict

Must-Fix Before Merge

Should-Fix (Before or After Merge)

Nice-to-Have

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant